Discussion:
[pve-devel] corosync problems - need help
Alexandre DERUMIER
2014-09-14 06:18:09 UTC
Permalink
Hi,

I have a corosync problem on my production cluster,
and I don't known how to debug.



Cluster is a 12 nodes cluster,
multicast is working fine


on this cluster, 2 nodes show "corosync [TOTEM ] Retransmit List"


All nodes are show
------------------
# cman_tool nodes
Node Sts Inc Joined Name
1 M 76636 2014-09-08 12:23:04 kvm6
2 M 76636 2014-09-08 12:23:04 kvm4
3 M 76636 2014-09-08 12:23:04 kvm3
4 M 76636 2014-09-08 12:23:04 kvm2
5 M 76636 2014-09-08 12:23:04 kvm5
6 M 76672 2014-09-12 16:52:08 kvm1
7 M 76636 2014-09-08 12:23:04 kvm8
8 M 76636 2014-09-08 12:23:04 kvm7
8 M 76636 2014-09-08 12:23:04 kvm9
10 M 76636 2014-09-08 12:23:04 kvm10
11 M 76944 2014-09-14 08:08:18 kvm11
12 M 4 2014-09-03 06:57:27 kvm12


I have quorum
--------------
#cman_tool status
Version: 6.2.0
Config Version: 12
Cluster Name: odiso
Cluster Id: 3337
Cluster Member: Yes
Cluster Generation: 76944
Membership state: Cluster-Member
Nodes: 12
Expected votes: 12
Total votes: 12
Node votes: 1
Quorum: 7
Active subsystems: 5
Flags:
Ports Bound: 0
Node name: kvm12
Node ID: 12
Multicast addresses: 239.192.13.22
Node addresses: 10.3.94.59




But I can't write anything in pmxcfs on any node (read is ok)

with a lot erros like this
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32310
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32320
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32330



Any idea ?

I would like to try to change corosync window_size, but how can I do it online ?

(and /etc/init.d/cman stop is hanging)
Dietmar Maurer
2014-09-14 06:41:09 UTC
Permalink
Post by Alexandre DERUMIER
on this cluster, 2 nodes show "corosync [TOTEM ] Retransmit List"
What kernel do you run? 2.6.32 or 3.10.0?
What is different on those nodes? kernel, network cards?
Post by Alexandre DERUMIER
But I can't write anything in pmxcfs on any node (read is ok)
with a lot erros like this
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32310
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32320
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32330
Any idea ?
Does it help if you restart the cluster file system:

# service pve-cluster restart

Note: You also need to restart depending services afterwards:

# service pvedaemon restart
# service pveproxy restart
# service pvestatd restart
Post by Alexandre DERUMIER
I would like to try to change corosync window_size, but how can I do it online ?
On all nodes?
Post by Alexandre DERUMIER
(and /etc/init.d/cman stop is hanging)
I guess you already tried to reboot that node?
Alexandre DERUMIER
2014-09-14 07:05:45 UTC
Permalink
Post by Dietmar Maurer
Post by Dietmar Maurer
What kernel do you run? 2.6.32 or 3.10.0?
1 node 2.6.32 , 1 node 3.10


What is different on those nodes? kernel, network cards?

All nodes are same model, but I have 3 nodes with kernel 3.10 and 8 nodes with 2.6.32 kernel.
(I'm currently migrate all nodes to 3.10)

I have added 2 nodes (kvm11,kvm12) with 3.10 kernel 1 week ago (without any multicast problem)
Post by Dietmar Maurer
Post by Dietmar Maurer
I would like to try to change corosync window_size, but how can I do it online ?
On all nodes?
Yes, if possible. As I can't edit cluster.conf (read only), don't known how to inject it online.
Post by Dietmar Maurer
Post by Dietmar Maurer
(and /etc/init.d/cman stop is hanging)
I guess you already tried to reboot that node?
I can't reboot for now, it's a production node, and I can't live migrate as pmxcfs is read only.


I'll try to restart all services on all nodes to see if It's help

----- Mail original -----

De: "Dietmar Maurer" <***@proxmox.com>
À: "Alexandre DERUMIER" <***@odiso.com>, pve-***@pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 08:41:09
Objet: RE: [pve-devel] corosync problems - need help
Post by Dietmar Maurer
on this cluster, 2 nodes show "corosync [TOTEM ] Retransmit List"
What kernel do you run? 2.6.32 or 3.10.0?
What is different on those nodes? kernel, network cards?
Post by Dietmar Maurer
But I can't write anything in pmxcfs on any node (read is ok)
with a lot erros like this
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32310
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32320
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32330
Any idea ?
Does it help if you restart the cluster file system:

# service pve-cluster restart

Note: You also need to restart depending services afterwards:

# service pvedaemon restart
# service pveproxy restart
# service pvestatd restart
Post by Dietmar Maurer
I would like to try to change corosync window_size, but how can I do it online ?
On all nodes?
Post by Dietmar Maurer
(and /etc/init.d/cman stop is hanging)
I guess you already tried to reboot that node?
Dietmar Maurer
2014-09-14 07:27:22 UTC
Permalink
Post by Alexandre DERUMIER
Post by Dietmar Maurer
Post by Alexandre DERUMIER
I would like to try to change corosync window_size, but how can I do it online
?
Post by Dietmar Maurer
On all nodes?
I meant: Is it read-only on all nodes?
Post by Alexandre DERUMIER
Yes, if possible. As I can't edit cluster.conf (read only), don't known how to inject
it online.
If you edit /etc/cluster/cluster.conf you need to increase version number to prevent overwrite. Then
restart cman.

If there are still working nodes where /etc/pve is writable edit it there.
Post by Alexandre DERUMIER
Post by Dietmar Maurer
Post by Alexandre DERUMIER
(and /etc/init.d/cman stop is hanging)
if n
Alexandre DERUMIER
2014-09-14 07:17:57 UTC
Permalink
Post by Dietmar Maurer
Post by Dietmar Maurer
# service pve-cluster restart
# service pvedaemon restart
# service pveproxy restart
# service pvestatd restart
Don't help.

Another strange thing, is that tcpdump show only multicast traffic for port 5054 from the 2 flooding nodes with retransmit.

all others nodes don't seem to send nothing.




----- Mail original -----

De: "Alexandre DERUMIER" <***@odiso.com>
À: "Dietmar Maurer" <***@proxmox.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 09:05:45
Objet: Re: [pve-devel] corosync problems - need help
Post by Dietmar Maurer
Post by Dietmar Maurer
What kernel do you run? 2.6.32 or 3.10.0?
1 node 2.6.32 , 1 node 3.10


What is different on those nodes? kernel, network cards?

All nodes are same model, but I have 3 nodes with kernel 3.10 and 8 nodes with 2.6.32 kernel.
(I'm currently migrate all nodes to 3.10)

I have added 2 nodes (kvm11,kvm12) with 3.10 kernel 1 week ago (without any multicast problem)
Post by Dietmar Maurer
Post by Dietmar Maurer
I would like to try to change corosync window_size, but how can I do it online ?
On all nodes?
Yes, if possible. As I can't edit cluster.conf (read only), don't known how to inject it online.
Post by Dietmar Maurer
Post by Dietmar Maurer
(and /etc/init.d/cman stop is hanging)
I guess you already tried to reboot that node?
I can't reboot for now, it's a production node, and I can't live migrate as pmxcfs is read only.


I'll try to restart all services on all nodes to see if It's help

----- Mail original -----

De: "Dietmar Maurer" <***@proxmox.com>
À: "Alexandre DERUMIER" <***@odiso.com>, pve-***@pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 08:41:09
Objet: RE: [pve-devel] corosync problems - need help
Post by Dietmar Maurer
on this cluster, 2 nodes show "corosync [TOTEM ] Retransmit List"
What kernel do you run? 2.6.32 or 3.10.0?
What is different on those nodes? kernel, network cards?
Post by Dietmar Maurer
But I can't write anything in pmxcfs on any node (read is ok)
with a lot erros like this
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32310
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32320
kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 32330
Any idea ?
Does it help if you restart the cluster file system:

# service pve-cluster restart

Note: You also need to restart depending services afterwards:

# service pvedaemon restart
# service pveproxy restart
# service pvestatd restart
Post by Dietmar Maurer
I would like to try to change corosync window_size, but how can I do it online ?
On all nodes?
Post by Dietmar Maurer
(and /etc/init.d/cman stop is hanging)
I guess you already tried to reboot that node?
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Alexandre DERUMIER
2014-09-14 08:10:01 UTC
Permalink
Post by Alexandre DERUMIER
Post by Dietmar Maurer
I meant: Is it read-only on all nodes?
Yes :(
Post by Alexandre DERUMIER
Post by Dietmar Maurer
If you edit /etc/cluster/cluster.conf you need to increase version number to prevent overwrite. Then
restart cman.
Ok, I'll try that

Seem that we can reload cluster.conf with "cman_tool version -r"
Post by Alexandre DERUMIER
Post by Dietmar Maurer
Post by Alexandre DERUMIER
Post by Alexandre DERUMIER
(and /etc/init.d/cman stop is hanging)
if nothing helps, try 'kill -9 ...'
yes,it's hanging on
#Stopping cluster:
# Stopping dlm_controld...

I'll test with increasing window_size.

I'll keep you in touch.

(thanks for the help)


----- Mail original -----

De: "Dietmar Maurer" <***@proxmox.com>
À: "Alexandre DERUMIER" <***@odiso.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 09:27:22
Objet: RE: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
Post by Dietmar Maurer
Post by Alexandre DERUMIER
I would like to try to change corosync window_size, but how can I do it online
?
Post by Dietmar Maurer
On all nodes?
I meant: Is it read-only on all nodes?
Post by Alexandre DERUMIER
Yes, if possible. As I can't edit cluster.conf (read only), don't known how to inject
it online.
If you edit /etc/cluster/cluster.conf you need to increase version number to prevent overwrite. Then
restart cman.

If there are still working nodes where /etc/pve is writable edit it there.
Post by Alexandre DERUMIER
Post by Dietmar Maurer
Post by Alexandre DERUMIER
(and /etc/init.d/cman stop is hanging)
if nothing helps, try 'kill -9 ...'
Alexandre DERUMIER
2014-09-14 08:34:36 UTC
Permalink
Another strange thing,

I have stopped 1 node,

and other nodes see it online ??????

#clustat nodes


Member Name ID Status
------ ---- ---- ------
kvm6 1 Online
kvm4 2 Online
kvm3 3 Online
kvm2 4 Online
kvm5 5 Online
kvm1 6 Online ---> the node is shutdown
kvm8 7 Online
kvm7 8 Online
kvm9 9 Online
kvm10 10 Online
kvm11 11 Online, Local
kvm12 12 Online


----- Mail original -----

De: "Alexandre DERUMIER" <***@odiso.com>
À: "Dietmar Maurer" <***@proxmox.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 10:10:01
Objet: Re: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
Post by Dietmar Maurer
I meant: Is it read-only on all nodes?
Yes :(
Post by Alexandre DERUMIER
Post by Dietmar Maurer
If you edit /etc/cluster/cluster.conf you need to increase version number to prevent overwrite. Then
restart cman.
Ok, I'll try that

Seem that we can reload cluster.conf with "cman_tool version -r"
Post by Alexandre DERUMIER
Post by Dietmar Maurer
Post by Alexandre DERUMIER
Post by Alexandre DERUMIER
(and /etc/init.d/cman stop is hanging)
if nothing helps, try 'kill -9 ...'
yes,it's hanging on
#Stopping cluster:
# Stopping dlm_controld...

I'll test with increasing window_size.

I'll keep you in touch.

(thanks for the help)


----- Mail original -----

De: "Dietmar Maurer" <***@proxmox.com>
À: "Alexandre DERUMIER" <***@odiso.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 09:27:22
Objet: RE: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
Post by Dietmar Maurer
Post by Alexandre DERUMIER
I would like to try to change corosync window_size, but how can I do it online
?
Post by Dietmar Maurer
On all nodes?
I meant: Is it read-only on all nodes?
Post by Alexandre DERUMIER
Yes, if possible. As I can't edit cluster.conf (read only), don't known how to inject
it online.
If you edit /etc/cluster/cluster.conf you need to increase version number to prevent overwrite. Then
restart cman.

If there are still working nodes where /etc/pve is writable edit it there.
Post by Alexandre DERUMIER
Post by Dietmar Maurer
Post by Alexandre DERUMIER
(and /etc/init.d/cman stop is hanging)
if nothing helps, try 'kill -9 ...'
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Alexandre DERUMIER
2014-09-14 08:52:50 UTC
Permalink
I have restarted 2 nodes,

they see them together but no the other nodes.


I think corosync is totally hanging on other nodes, they don't have see 2 nodes nodes.


Now I'll try to find a way to restart corosync without restarting the full node.

(main problem is dlm_control, not sure I can kill it)


----- Mail original -----

De: "Alexandre DERUMIER" <***@odiso.com>
À: "Dietmar Maurer" <***@proxmox.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 10:34:36
Objet: Re: [pve-devel] corosync problems - need help

Another strange thing,

I have stopped 1 node,

and other nodes see it online ??????

#clustat nodes


Member Name ID Status
------ ---- ---- ------
kvm6 1 Online
kvm4 2 Online
kvm3 3 Online
kvm2 4 Online
kvm5 5 Online
kvm1 6 Online ---> the node is shutdown
kvm8 7 Online
kvm7 8 Online
kvm9 9 Online
kvm10 10 Online
kvm11 11 Online, Local
kvm12 12 Online


----- Mail original -----

De: "Alexandre DERUMIER" <***@odiso.com>
À: "Dietmar Maurer" <***@proxmox.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 10:10:01
Objet: Re: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
Post by Dietmar Maurer
I meant: Is it read-only on all nodes?
Yes :(
Post by Alexandre DERUMIER
Post by Dietmar Maurer
If you edit /etc/cluster/cluster.conf you need to increase version number to prevent overwrite. Then
restart cman.
Ok, I'll try that

Seem that we can reload cluster.conf with "cman_tool version -r"
Post by Alexandre DERUMIER
Post by Dietmar Maurer
Post by Alexandre DERUMIER
Post by Alexandre DERUMIER
(and /etc/init.d/cman stop is hanging)
if nothing helps, try 'kill -9 ...'
yes,it's hanging on
#Stopping cluster:
# Stopping dlm_controld...

I'll test with increasing window_size.

I'll keep you in touch.

(thanks for the help)


----- Mail original -----

De: "Dietmar Maurer" <***@proxmox.com>
À: "Alexandre DERUMIER" <***@odiso.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 09:27:22
Objet: RE: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
Post by Dietmar Maurer
Post by Alexandre DERUMIER
I would like to try to change corosync window_size, but how can I do it online
?
Post by Dietmar Maurer
On all nodes?
I meant: Is it read-only on all nodes?
Post by Alexandre DERUMIER
Yes, if possible. As I can't edit cluster.conf (read only), don't known how to inject
it online.
If you edit /etc/cluster/cluster.conf you need to increase version number to prevent overwrite. Then
restart cman.

If there are still working nodes where /etc/pve is writable edit it there.
Post by Alexandre DERUMIER
Post by Dietmar Maurer
Post by Alexandre DERUMIER
(and /etc/init.d/cman stop is hanging)
if nothing helps, try 'kill -9 ...'
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Alexandre DERUMIER
2014-09-14 09:06:51 UTC
Permalink
Ok,I finally solved,

kill -9 dlm_controld
kill -9 corosync -f

and service cman start


Now all is working fine again.

Thanks for the help !




----- Mail original -----

De: "Alexandre DERUMIER" <***@odiso.com>
À: "Dietmar Maurer" <***@proxmox.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 10:52:50
Objet: Re: [pve-devel] corosync problems - need help

I have restarted 2 nodes,

they see them together but no the other nodes.


I think corosync is totally hanging on other nodes, they don't have see 2 nodes nodes.


Now I'll try to find a way to restart corosync without restarting the full node.

(main problem is dlm_control, not sure I can kill it)


----- Mail original -----

De: "Alexandre DERUMIER" <***@odiso.com>
À: "Dietmar Maurer" <***@proxmox.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 10:34:36
Objet: Re: [pve-devel] corosync problems - need help

Another strange thing,

I have stopped 1 node,

and other nodes see it online ??????

#clustat nodes


Member Name ID Status
------ ---- ---- ------
kvm6 1 Online
kvm4 2 Online
kvm3 3 Online
kvm2 4 Online
kvm5 5 Online
kvm1 6 Online ---> the node is shutdown
kvm8 7 Online
kvm7 8 Online
kvm9 9 Online
kvm10 10 Online
kvm11 11 Online, Local
kvm12 12 Online


----- Mail original -----

De: "Alexandre DERUMIER" <***@odiso.com>
À: "Dietmar Maurer" <***@proxmox.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 10:10:01
Objet: Re: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
Post by Dietmar Maurer
I meant: Is it read-only on all nodes?
Yes :(
Post by Alexandre DERUMIER
Post by Dietmar Maurer
If you edit /etc/cluster/cluster.conf you need to increase version number to prevent overwrite. Then
restart cman.
Ok, I'll try that

Seem that we can reload cluster.conf with "cman_tool version -r"
Post by Alexandre DERUMIER
Post by Dietmar Maurer
Post by Alexandre DERUMIER
Post by Alexandre DERUMIER
(and /etc/init.d/cman stop is hanging)
if nothing helps, try 'kill -9 ...'
yes,it's hanging on
#Stopping cluster:
# Stopping dlm_controld...

I'll test with increasing window_size.

I'll keep you in touch.

(thanks for the help)


----- Mail original -----

De: "Dietmar Maurer" <***@proxmox.com>
À: "Alexandre DERUMIER" <***@odiso.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 09:27:22
Objet: RE: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
Post by Dietmar Maurer
Post by Alexandre DERUMIER
I would like to try to change corosync window_size, but how can I do it online
?
Post by Dietmar Maurer
On all nodes?
I meant: Is it read-only on all nodes?
Post by Alexandre DERUMIER
Yes, if possible. As I can't edit cluster.conf (read only), don't known how to inject
it online.
If you edit /etc/cluster/cluster.conf you need to increase version number to prevent overwrite. Then
restart cman.

If there are still working nodes where /etc/pve is writable edit it there.
Post by Alexandre DERUMIER
Post by Dietmar Maurer
Post by Alexandre DERUMIER
(and /etc/init.d/cman stop is hanging)
if nothing helps, try 'kill -9 ...'
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Dietmar Maurer
2014-09-14 10:53:45 UTC
Permalink
Post by Alexandre DERUMIER
Ok,I finally solved,
kill -9 dlm_controld
kill -9 corosync -f
and service cman start
Now all is working fine again.
I am curios - you have done that on all nodes, or only on the failing 2 nodes?
Alexandre DERUMIER
2014-09-14 13:41:26 UTC
Permalink
Post by Alexandre DERUMIER
Post by Dietmar Maurer
I am curios - you have done that on all nodes, or only on the failing 2 nodes?
Yes, I need to do it on all nodes.



I have done more invesgations, and now I can reproduce the problem 100%

The problem seem to come from a specific node: kvm11

When I start cman on this node,

I have :
pmxcfs[31484]: [status] notice: cpg_send_message retry XX

on all other nodes

Same hardware than other nodes, I need to check the network layer.


On the faulty node, I see also some pmxcfs segfaults in dmesg

[976776.602200] pmxcfs[3130]: segfault at 7ff1dcadef08 ip 00007ff1dcadef08 sp 00007fffd89cfe68 error 15
[977517.260211] pmxcfs[4947]: segfault at 1956b00 ip 0000000001956b00 sp 00007ffff3b109e8 error 15
[980494.722550] pmxcfs[15205]: segfault at 7f712457ef08 ip 00007f712457ef08 sp 00007fff4a916668 error 15



----- Mail original -----

De: "Dietmar Maurer" <***@proxmox.com>
À: "Alexandre DERUMIER" <***@odiso.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 12:53:45
Objet: RE: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
Ok,I finally solved,
kill -9 dlm_controld
kill -9 corosync -f
and service cman start
Now all is working fine again.
I am curios - you have done that on all nodes, or only on the failing 2 nodes?
Stefan Priebe - Profihost AG
2014-09-14 14:06:07 UTC
Permalink
Memory defect?

Stefan

Excuse my typo sent from my mobile phone.
Post by Alexandre DERUMIER
Post by Alexandre DERUMIER
Post by Dietmar Maurer
I am curios - you have done that on all nodes, or only on the failing 2 nodes?
Yes, I need to do it on all nodes.
I have done more invesgations, and now I can reproduce the problem 100%
The problem seem to come from a specific node: kvm11
When I start cman on this node,
pmxcfs[31484]: [status] notice: cpg_send_message retry XX
on all other nodes
Same hardware than other nodes, I need to check the network layer.
On the faulty node, I see also some pmxcfs segfaults in dmesg
[976776.602200] pmxcfs[3130]: segfault at 7ff1dcadef08 ip 00007ff1dcadef08 sp 00007fffd89cfe68 error 15
[977517.260211] pmxcfs[4947]: segfault at 1956b00 ip 0000000001956b00 sp 00007ffff3b109e8 error 15
[980494.722550] pmxcfs[15205]: segfault at 7f712457ef08 ip 00007f712457ef08 sp 00007fff4a916668 error 15
----- Mail original -----
Envoyé: Dimanche 14 Septembre 2014 12:53:45
Objet: RE: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
Ok,I finally solved,
kill -9 dlm_controld
kill -9 corosync -f
and service cman start
Now all is working fine again.
I am curios - you have done that on all nodes, or only on the failing 2 nodes?
_______________________________________________
pve-devel mailing list
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Alexandre DERUMIER
2014-09-14 14:11:58 UTC
Permalink
Note that the corosync layer seem to be fine

when cman start on the faulty node,

I see in corosync.log of other nodes, the member join

then start the

Sep 14 15:49:47 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 10
Sep 14 15:49:48 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 20
Sep 14 15:49:49 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 30
Sep 14 15:49:50 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 40
Sep 14 15:49:51 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 50
Sep 14 15:49:52 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 60
Sep 14 15:49:53 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 70
Sep 14 15:49:54 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 80
Sep 14 15:49:55 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 90
Sep 14 15:49:56 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retry 100


- then killing corosync on the faulty node

and It's work again:

Sep 14 15:49:56 kvm1 pmxcfs[31484]: [status] notice: cpg_send_message retried 100 times


seem to be in:
data/src/dfsm.c

result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len);
if (retry && result == CPG_ERR_TRY_AGAIN) {
nanosleep(&tvreq, NULL);
++retries;
if ((retries % 10) == 0)
cfs_dom_message(dfsm->log_domain, "cpg_send_message retry %d", retries);
if (retries < 100)
goto loop;
}

if (retries)
cfs_dom_message(dfsm->log_domain, "cpg_send_message retried %d times", retries);




----- Mail original -----

De: "Alexandre DERUMIER" <***@odiso.com>
À: "Dietmar Maurer" <***@proxmox.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 15:41:26
Objet: Re: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
Post by Dietmar Maurer
I am curios - you have done that on all nodes, or only on the failing 2 nodes?
Yes, I need to do it on all nodes.



I have done more invesgations, and now I can reproduce the problem 100%

The problem seem to come from a specific node: kvm11

When I start cman on this node,

I have :
pmxcfs[31484]: [status] notice: cpg_send_message retry XX

on all other nodes

Same hardware than other nodes, I need to check the network layer.


On the faulty node, I see also some pmxcfs segfaults in dmesg

[976776.602200] pmxcfs[3130]: segfault at 7ff1dcadef08 ip 00007ff1dcadef08 sp 00007fffd89cfe68 error 15
[977517.260211] pmxcfs[4947]: segfault at 1956b00 ip 0000000001956b00 sp 00007ffff3b109e8 error 15
[980494.722550] pmxcfs[15205]: segfault at 7f712457ef08 ip 00007f712457ef08 sp 00007fff4a916668 error 15



----- Mail original -----

De: "Dietmar Maurer" <***@proxmox.com>
À: "Alexandre DERUMIER" <***@odiso.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Dimanche 14 Septembre 2014 12:53:45
Objet: RE: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
Ok,I finally solved,
kill -9 dlm_controld
kill -9 corosync -f
and service cman start
Now all is working fine again.
I am curios - you have done that on all nodes, or only on the failing 2 nodes?
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Dietmar Maurer
2014-09-15 03:43:56 UTC
Permalink
Post by Alexandre DERUMIER
data/src/dfsm.c
result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len);
if (retry && result == CPG_ERR_TRY_AGAIN) {
This just indicates that corosync does not work
Alexandre DERUMIER
2014-09-15 05:06:40 UTC
Permalink
Post by Alexandre DERUMIER
This just indicates that corosync does not work as expected.
My understand is that the faulty node join the mutlicast group, other see it.

but when others nodes try to talk with him, they have no response ?



I'm going to do some wireshark network traces today

I'll also try to update all other nodes to kernel 3.10. (not sure it's related)


----- Mail original -----

De: "Dietmar Maurer" <***@proxmox.com>
À: "Alexandre DERUMIER" <***@odiso.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Lundi 15 Septembre 2014 05:43:56
Objet: RE: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
data/src/dfsm.c
result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len);
if (retry && result == CPG_ERR_TRY_AGAIN) {
This just indicates that corosync does not work as expected.
Alexandre DERUMIER
2014-09-15 05:26:52 UTC
Permalink
Also, about the pmxcfs sefgaults,

I have see this messages

Sep 14 09:06:33 kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 62840

Sep 14 10:57:25 kvm11 pmxcfs[13112]: [dcdb] notice: cpg_join retry 65090

with retry around 65000 (16bits)



and
int retries = 0;
result = cpg_join(dfsm->cpg_handle, &dfsm->cpg_group_name);
if (result == CPG_ERR_TRY_AGAIN) {
nanosleep(&tvreq, NULL);
++retries;
if ((retries % 10) == 0)
cfs_dom_message(dfsm->log_domain, "cpg_join retry %d", retries);
goto loop;
}


could it be related to retries integer type?



----- Mail original -----

De: "Alexandre DERUMIER" <***@odiso.com>
À: "Dietmar Maurer" <***@proxmox.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Lundi 15 Septembre 2014 07:06:40
Objet: Re: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
This just indicates that corosync does not work as expected.
My understand is that the faulty node join the mutlicast group, other see it.

but when others nodes try to talk with him, they have no response ?



I'm going to do some wireshark network traces today

I'll also try to update all other nodes to kernel 3.10. (not sure it's related)


----- Mail original -----

De: "Dietmar Maurer" <***@proxmox.com>
À: "Alexandre DERUMIER" <***@odiso.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Lundi 15 Septembre 2014 05:43:56
Objet: RE: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
data/src/dfsm.c
result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len);
if (retry && result == CPG_ERR_TRY_AGAIN) {
This just indicates that corosync does not work as expected.
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Dietmar Maurer
2014-09-16 05:51:07 UTC
Permalink
Post by Alexandre DERUMIER
with retry around 65000 (16bits)
and
int retries = 0;
result = cpg_join(dfsm->cpg_handle, &dfsm->cpg_group_name);
if (result == CPG_ERR_TRY_AGAIN) {
nanosleep(&tvreq, NULL);
++retries;
if ((retries % 10) == 0)
cfs_dom_message(dfsm->log_domain, "cpg_join retry %d", retries);
goto loop;
}
could it be related to retries integer type?
First, int is 32bit. Second, interger overflow does not r
Alexandre DERUMIER
2014-09-15 16:58:38 UTC
Permalink
some news,

I forgot to say that I'm using openvswitch

on the defect node I see in
/var/log/openvswitch/ovs-vswitchd.log

a lot of

2014-09-15T15:44:07.536Z|77368|poll_loop|INFO|wakeup due to 0-ms timeout at ../ofproto/ofproto-dpif-upcall.c:253 (56% CPU usage)
2014-09-15T15:44:07.536Z|77369|poll_loop|INFO|wakeup due to [POLLIN] on fd 28 (FIFO pipe:[29855]) at ../lib/seq.c:157 (56% CPU usage)
2014-09-15T15:44:07.536Z|77370|poll_loop|INFO|wakeup due to 0-ms timeout at ../ofproto/ofproto-dpif-upcall.c:253 (56% CPU usage)
2014-09-15T15:44:07.537Z|77371|poll_loop|INFO|wakeup due to [POLLIN] on fd 28 (FIFO pipe:[29855]) at ../lib/seq.c:157 (56% CPU usage)
2014-09-15T15:44:10.535Z|77375|poll_loop|INFO|wakeup due to 0-ms timeout at ../ofproto/ofproto-dpif-upcall.c:253 (54% CPU usage)
2014-09-15T15:44:19.535Z|77379|poll_loop|INFO|wakeup due to [POLLIN] on fd 28 (FIFO pipe:[29855]) at ../lib/seq.c:157 (51% CPU usage)
2014-09-15T15:44:28.537Z|77385|poll_loop|INFO|wakeup due to [POLLIN] on fd 28 (FIFO pipe:[29855]) at ../lib/seq.c:157 (53% CPU usage)
2014-09-15T15:44:28.537Z|77386|poll_loop|INFO|wakeup due to 0-ms timeout at ../ofproto/ofproto-dpif-upcall.c:253 (53% CPU usage)
2014-09-15T15:44:34.535Z|77390|poll_loop|INFO|wakeup due to [POLLIN] on fd 28 (FIFO pipe:[29855]) at ../lib/seq.c:157 (52% CPU usage)


I'm not sure it's related, but cpu of ovs-vswitchd daemon is indeed high (50-70% of 1core) (But I don't have packets lost in vms or host)

I found a patch about this
http://git.openvswitch.org/cgi-bin/gitweb.cgi?p=openvswitch;a=commit;h=9b32ece62481706b0a340f7a100fe79ad9caad9e


It's possibly related to the number of taps/ports on ovs bridge. (I have a lot of them)

but seem that it's not yet in current ovs 2.0.1.

So, I'm going to test ovs 2.3. (seem to work with kernel 3.10 ovs module)



----- Mail original -----

De: "Alexandre DERUMIER" <***@odiso.com>
À: "Dietmar Maurer" <***@proxmox.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Lundi 15 Septembre 2014 07:26:52
Objet: Re: [pve-devel] corosync problems - need help

Also, about the pmxcfs sefgaults,

I have see this messages

Sep 14 09:06:33 kvm1 pmxcfs[65403]: [dcdb] notice: cpg_join retry 62840

Sep 14 10:57:25 kvm11 pmxcfs[13112]: [dcdb] notice: cpg_join retry 65090

with retry around 65000 (16bits)



and
int retries = 0;
result = cpg_join(dfsm->cpg_handle, &dfsm->cpg_group_name);
if (result == CPG_ERR_TRY_AGAIN) {
nanosleep(&tvreq, NULL);
++retries;
if ((retries % 10) == 0)
cfs_dom_message(dfsm->log_domain, "cpg_join retry %d", retries);
goto loop;
}


could it be related to retries integer type?



----- Mail original -----

De: "Alexandre DERUMIER" <***@odiso.com>
À: "Dietmar Maurer" <***@proxmox.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Lundi 15 Septembre 2014 07:06:40
Objet: Re: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
This just indicates that corosync does not work as expected.
My understand is that the faulty node join the mutlicast group, other see it.

but when others nodes try to talk with him, they have no response ?



I'm going to do some wireshark network traces today

I'll also try to update all other nodes to kernel 3.10. (not sure it's related)


----- Mail original -----

De: "Dietmar Maurer" <***@proxmox.com>
À: "Alexandre DERUMIER" <***@odiso.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Lundi 15 Septembre 2014 05:43:56
Objet: RE: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
data/src/dfsm.c
result = cpg_mcast_joined(dfsm->cpg_handle, CPG_TYPE_AGREED, iov, len);
if (retry && result == CPG_ERR_TRY_AGAIN) {
This just indicates that corosync does not work as expected.
Alexandre DERUMIER
2014-09-16 06:33:56 UTC
Permalink
Post by Alexandre DERUMIER
First, int is 32bit. Second, interger overflow does not raise an exception in C.
So that cannot be the reason.
Ok, sorry. ( I thinked about this because in log I was seeing increment up to around 65000, then no more log ).


What I have done yesterday :

- update all nodes to 3.10 kernel
- upgrade openvswitch to 2.3.0 (I had see an high cpu bug, and 2.3 fix it).


But don't help.

I have been able to bring back this node in the cluster for around 5min, then It begin to hang again.


Today, I'll try to shutdown corosync on all servers,

then start corosync on this node and join other nodes.

(I want be sure that it's not because I have 2 more nodes in my cluster)


I'll keep you in touch

----- Mail original -----

De: "Dietmar Maurer" <***@proxmox.com>
À: "Alexandre DERUMIER" <***@odiso.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Mardi 16 Septembre 2014 07:51:07
Objet: RE: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
with retry around 65000 (16bits)
and
int retries = 0;
result = cpg_join(dfsm->cpg_handle, &dfsm->cpg_group_name);
if (result == CPG_ERR_TRY_AGAIN) {
nanosleep(&tvreq, NULL);
++retries;
if ((retries % 10) == 0)
cfs_dom_message(dfsm->log_domain, "cpg_join retry %d",
retries);
goto loop;
}
could it be related to retries integer type?
First, int is 32bit. Second, interger overflow does not raise an exception in C.
So that cannot be the reason.
Alexandre DERUMIER
2014-09-16 21:56:09 UTC
Permalink
Some news,

I finally stop/start the node (shutdown the vm too :( ),

and finally it join correctly the cluster.


So, I really don't known what could be hang... Damned...


BTW, do you had already have a look at corosync2 + pacemaker ? (Seem that this the supported model in rhel7)

I known that pacemker replace rgmanager, don't known if corosync2 need to do a lot of change in pmxfs.



----- Mail original -----

De: "Alexandre DERUMIER" <***@odiso.com>
À: "Dietmar Maurer" <***@proxmox.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Mardi 16 Septembre 2014 08:33:56
Objet: Re: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
First, int is 32bit. Second, interger overflow does not raise an exception in C.
So that cannot be the reason.
Ok, sorry. ( I thinked about this because in log I was seeing increment up to around 65000, then no more log ).


What I have done yesterday :

- update all nodes to 3.10 kernel
- upgrade openvswitch to 2.3.0 (I had see an high cpu bug, and 2.3 fix it).


But don't help.

I have been able to bring back this node in the cluster for around 5min, then It begin to hang again.


Today, I'll try to shutdown corosync on all servers,

then start corosync on this node and join other nodes.

(I want be sure that it's not because I have 2 more nodes in my cluster)


I'll keep you in touch

----- Mail original -----

De: "Dietmar Maurer" <***@proxmox.com>
À: "Alexandre DERUMIER" <***@odiso.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Mardi 16 Septembre 2014 07:51:07
Objet: RE: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
with retry around 65000 (16bits)
and
int retries = 0;
result = cpg_join(dfsm->cpg_handle, &dfsm->cpg_group_name);
if (result == CPG_ERR_TRY_AGAIN) {
nanosleep(&tvreq, NULL);
++retries;
if ((retries % 10) == 0)
cfs_dom_message(dfsm->log_domain, "cpg_join retry %d",
retries);
goto loop;
}
could it be related to retries integer type?
First, int is 32bit. Second, interger overflow does not raise an exception in C.
So that cannot be the reason.
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Dietmar Maurer
2014-09-17 08:54:41 UTC
Permalink
Post by Alexandre DERUMIER
BTW, do you had already have a look at corosync2 + pacemaker ? (Seem that
this the supported model in rhel7)
The problem with pacemaker is its complexity. IMHO it is totally unusable for most users.
For that reason, I a
Alexandre DERUMIER
2014-09-17 06:11:06 UTC
Permalink
one last thing I don't have tested,

is to update libqb, which is really old on wheezy (0.11)

Last version is 0.17

and I have seen bugs related to corosync hanging because of libqb

https://bugs.launchpad.net/ubuntu/+source/libqb/+bug/1341496


I'll try to backport package from debian sid.


----- Mail original -----

De: "Alexandre DERUMIER" <***@odiso.com>
À: "Dietmar Maurer" <***@proxmox.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Mardi 16 Septembre 2014 23:56:09
Objet: Re: [pve-devel] corosync problems - need help

Some news,

I finally stop/start the node (shutdown the vm too :( ),

and finally it join correctly the cluster.


So, I really don't known what could be hang... Damned...


BTW, do you had already have a look at corosync2 + pacemaker ? (Seem that this the supported model in rhel7)

I known that pacemker replace rgmanager, don't known if corosync2 need to do a lot of change in pmxfs.



----- Mail original -----

De: "Alexandre DERUMIER" <***@odiso.com>
À: "Dietmar Maurer" <***@proxmox.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Mardi 16 Septembre 2014 08:33:56
Objet: Re: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
First, int is 32bit. Second, interger overflow does not raise an exception in C.
So that cannot be the reason.
Ok, sorry. ( I thinked about this because in log I was seeing increment up to around 65000, then no more log ).


What I have done yesterday :

- update all nodes to 3.10 kernel
- upgrade openvswitch to 2.3.0 (I had see an high cpu bug, and 2.3 fix it).


But don't help.

I have been able to bring back this node in the cluster for around 5min, then It begin to hang again.


Today, I'll try to shutdown corosync on all servers,

then start corosync on this node and join other nodes.

(I want be sure that it's not because I have 2 more nodes in my cluster)


I'll keep you in touch

----- Mail original -----

De: "Dietmar Maurer" <***@proxmox.com>
À: "Alexandre DERUMIER" <***@odiso.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Mardi 16 Septembre 2014 07:51:07
Objet: RE: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
with retry around 65000 (16bits)
and
int retries = 0;
result = cpg_join(dfsm->cpg_handle, &dfsm->cpg_group_name);
if (result == CPG_ERR_TRY_AGAIN) {
nanosleep(&tvreq, NULL);
++retries;
if ((retries % 10) == 0)
cfs_dom_message(dfsm->log_domain, "cpg_join retry %d",
retries);
goto loop;
}
could it be related to retries integer type?
First, int is 32bit. Second, interger overflow does not raise an exception in C.
So that cannot be the reason.
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
_______________________________________________
pve-devel mailing list
pve-***@pve.proxmox.com
http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-devel
Alexandre DERUMIER
2014-09-17 09:21:33 UTC
Permalink
Post by Alexandre DERUMIER
Post by Dietmar Maurer
The problem with pacemaker is its complexity. IMHO it is totally unusable for most users.
I'm using it on small cluster, but for basic things (ip failover, or service failover)
Now, indeed, with big cluster and complex thing like vm management,not so easy
Post by Alexandre DERUMIER
Post by Dietmar Maurer
For that reason, I am thinking to write my own HA manager ...
Great :)

(I personaly don't use rgmanager and HA currently, because I'm always scared by these corosync problems)




----- Mail original -----

De: "Dietmar Maurer" <***@proxmox.com>
À: "Alexandre DERUMIER" <***@odiso.com>
Cc: pve-***@pve.proxmox.com
Envoyé: Mercredi 17 Septembre 2014 10:54:41
Objet: RE: [pve-devel] corosync problems - need help
Post by Alexandre DERUMIER
BTW, do you had already have a look at corosync2 + pacemaker ? (Seem that
this the supported model in rhel7)
The problem with pacemaker is its complexity. IMHO it is totally unusable for most users.
For that reason, I am thinking to write my own HA manager ...

Loading...